Goto

Collaborating Authors

 Gesture Recognition


Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency

Neural Information Processing Systems

Pre-training on time series poses a unique challenge due to the potential mismatch between pre-training and target domains, such as shifts in temporal dynamics, fast-evolving trends, and long-range and short-cyclic effects, which can lead to poor downstream performance. While domain adaptation methods can mitigate these shifts, most methods need examples directly from the target domain, making them suboptimal for pre-training. To address this challenge, methods need to accommodate target domains with different temporal dynamics and be capable of doing so without seeing any target examples during pre-training. Relative to other modalities, in time series, we expect that time-based and frequency-based representations of the same example are located close together in the time-frequency space. To this end, we posit that time-frequency consistency (TF-C) --- embedding a time-based neighborhood of an example close to its frequency-based neighborhood --- is desirable for pre-training. Motivated by TF-C, we define a decomposable pre-training model, where the self-supervised signal is provided by the distance between time and frequency components, each individually trained by contrastive estimation. We evaluate the new method on eight datasets, including electrodiagnostic testing, human activity recognition, mechanical fault detection, and physical status monitoring. Experiments against eight state-of-the-art methods show that TF-C outperforms baselines by 15.4% (F1 score) on average in one-to-one settings (e.g., fine-tuning an EEG-pretrained model on EMG data) and by 8.4% (precision) in challenging one-to-many settings (e.g., fine-tuning an EEG-pretrained model for either hand-gesture recognition or mechanical fault prediction), reflecting the breadth of scenarios that arise in real-world applications.


A Comparative Study of EMG- and IMU-based Gesture Recognition at the Wrist and Forearm

Baghernezhad, Soroush, Mohammadreza, Elaheh, da Fonseca, Vinicius Prado, Zou, Ting, Jiang, Xianta

arXiv.org Artificial Intelligence

Gestures are an integral part of our daily interactions with the environment. Hand gesture recognition (HGR) is the process of interpreting human intent through various input modalities, such as visual data (images and videos) and bio-signals. Bio-signals are widely used in HGR due to their ability to be captured non-invasively via sensors placed on the arm. Among these, surface electromyography (sEMG), which measures the electrical activity of muscles, is the most extensively studied modality. However, less-explored alternatives such as inertial measurement units (IMUs) can provide complementary information on subtle muscle movements, which makes them valuable for gesture recognition. In this study, we investigate the potential of using IMU signals from different muscle groups to capture user intent. Our results demonstrate that IMU signals contain sufficient information to serve as the sole input sensor for static gesture recognition. Moreover, we compare different muscle groups and check the quality of pattern recognition on individual muscle groups. We further found that tendon-induced micro-movement captured by IMUs is a major contributor to static gesture recognition. We believe that leveraging muscle micro-movement information can enhance the usability of prosthetic arms for amputees. This approach also offers new possibilities for hand gesture recognition in fields such as robotics, teleoperation, sign language interpretation, and beyond.


Science confirms hand gestures make you seem more persuasive

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. A study recently published in the journal suggests something many Italians already knew--certain hand gestures really do make people seem more competent and persuasive. "One of the key takeaways for marketers is that you can use the same content, but if you pay more attention to how that content is delivered, it could have a big impact on persuasiveness," Mi Zhou, study co-author and University of British Columbia digital market research scientist, said in a statement . Zhou and her colleagues analyzed 2,184 TED Talks using AI and automated video analysis. They compared hundreds of thousands of video clips of hand features to audience engagement metrics, and asked study participants to rate the speakers and products in videos of sales pitches with different hand movements.


SASG-DA: Sparse-Aware Semantic-Guided Diffusion Augmentation For Myoelectric Gesture Recognition

Liu, Chen, Han, Can, Xu, Weishi, Wang, Yaqi, Qian, Dahong

arXiv.org Artificial Intelligence

Abstract-- Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Sampling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling (SASS) strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples. Esture recognition serves as a fundamental technology for advancing human-machine interaction. Among various gesture recognition modalities, surface electromyography (sEMG)-based approaches have gained increasing attention due to their non-invasive nature, high temporal resolution, and ability to directly capture muscle activation signals associated with voluntary movement [1].


Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks

Shibata, Koki, Ling, Tianheng, Qian, Chao, Matsui, Tomokazu, Suwa, Hirohiko, Yasumoto, Keiichi, Schiele, Gregor

arXiv.org Artificial Intelligence

These authors contributed equally to this work. Abstract--The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, they often rely on complex preprocessing and large Neural Networks (NNs) requiring costly high-performance hardware, resulting in high energy usage and limited real-world deployability. This study proposes an energy-efficient solution deploying compact NNs on low-power Field-Programmable Gate Arrays (FPGAs) to enable real-time gesture recognition with competitive accuracy. We adopt a series of optimizations: (1) We replace complex spectral preprocessing with raw waveform input, eliminating complex on-board preprocessing while reducing input size by 21 without sacrificing accuracy. A ping-pong buffering mechanism in 1D-SepCNN further improves deployability under tight memory constraints. Evaluated on two swipe-direction datasets with multiple users and ordinary tables, our approach achieves low-latency, energy-efficient inference on the AMD Spartan-7 XC7S25 FPGA. Under the PS data splitting setting, the selected 6-bit 1D-CNN reaches 0.970 average accuracy across users with 9.22 ms latency. The chosen 8-bit 1D-SepCNN further reduces latency to 6.83 ms (over 53 CPU speedup) with slightly lower accuracy (0.949). Both consume under 1.2 mJ per inference, demonstrating suitability for long-term edge operation.


SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks

Park, Junyong, Levy, Oron, Adaimi, Rebecca, Liberman, Asaf, Laput, Gierad, Bedri, Abdelkareem

arXiv.org Artificial Intelligence

Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Y et most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models. Advances in self-supervised learning (SSL) and large-scale datasets have enabled foundation models that support multiple tasks through shared representations (Y ang et al., 2024; Oquab et al., 2023). This is particularly valuable for wearable devices, where maintaining separate models dedicated for each task is often impractical due to memory and compute constraints. Accelerometers are widely used sensors in wearables for diverse motion-related tasks. Recent studies show that SSL approaches can train foundation models effective in Human Activity Recognition (HAR) tasks such as exercise and locomotion classification (Logacjov, 2024). However, their applicability to broader accelerometer tasks, such as gait analysis and gesture recognition, remains largely unexplored. This contrasts with domains such as audio, where foundation models have been applied beyond a single task, spanning speech-to-text, speaker identification, and emotion recognition.


A Bimanual Gesture Interface for ROS-Based Mobile Manipulators Using TinyML and Sensor Fusion

Bhuiyan, Najeeb Ahmed, Huq, M. Nasimul, Chowdhury, Sakib H., Mangharam, Rahul

arXiv.org Artificial Intelligence

Gesture-based control for mobile manipulators faces persistent challenges in reliability, efficiency, and intuitiveness. This paper presents a dual-hand gesture interface that integrates TinyML, spectral analysis, and sensor fusion within a ROS framework to address these limitations. The system uses left-hand tilt and finger flexion, captured using accelerometer and flex sensors, for mobile base navigation, while right-hand IMU signals are processed through spectral analysis and classified by a lightweight neural network. This pipeline enables TinyML-based gesture recognition to control a 7-DOF Kinova Gen3 manipulator. By supporting simultaneous navigation and manipulation, the framework improves efficiency and coordination compared to sequential methods. Key contributions include a bimanual control architecture, real-time low-power gesture recognition, robust multimodal sensor fusion, and a scalable ROS-based implementation. The proposed approach advances Human-Robot Interaction (HRI) for industrial automation, assistive robotics, and hazardous environments, offering a cost-effective, open-source solution with strong potential for real-world deployment and further optimization.


Gesture-Based Robot Control Integrating Mm-wave Radar and Behavior Trees

Song, Yuqing, Tonola, Cesare, Savazzi, Stefano, Kianoush, Sanaz, Pedrocchi, Nicola, Sigg, Stephan

arXiv.org Artificial Intelligence

As robots become increasingly prevalent in both homes and industrial settings, the demand for intuitive and efficient human-machine interaction continues to rise. Gesture recognition offers an intuitive control method that does not require physical contact with devices and can be implemented using various sensing technologies. Wireless solutions are particularly flexible and minimally invasive. While camera-based vision systems are commonly used, they often raise privacy concerns and can struggle in complex or poorly lit environments. In contrast, radar sensing preserves privacy, is robust to occlusions and lighting, and provides rich spatial data such as distance, relative velocity, and angle. We present a gesture-controlled robotic arm using mm-wave radar for reliable, contactless motion recognition. Nine gestures are recognized and mapped to real-time commands with precision. Case studies are conducted to demonstrate the system practicality, performance and reliability for gesture-based robotic manipulation. Unlike prior work that treats gesture recognition and robotic control separately, our system unifies both into a real-time pipeline for seamless, contactless human-robot interaction.


Technical Perspective: NeuroRadar: Can Radar Systems Be Reimagined Using Computational Principles?

Communications of the ACM

Interest in miniature radar systems has grown dramatically in recent years as they enable rich interaction and health monitoring in everyday settings. By 2025, industrial radar applications are anticipated to encompass 10 million devices, whereas the consumer market will reach a substantial 250 million. The applications are diverse--for example, Google's Pixel phones incorporated radar for gesture control, while small radar sensors are being deployed in homes to monitor elderly residents' movements and detect falls, offering more privacy than camera-based solutions. However, conventional radar architectures rely on complex RF front ends with power amplifiers, low-noise amplifiers, and phase-locked loops, collectively consuming hundreds of milliwatts of power. This makes radar sensing impractical for battery-powered or self-powered Internet of Things (IoT) devices and wearables.


Improving Tactile Gesture Recognition with Optical Flow

Zhong, Shaohong, Albini, Alessandro, Caroleo, Giammarco, Cannata, Giorgio, Maiolino, Perla

arXiv.org Artificial Intelligence

Tactile gesture recognition systems play a crucial role in Human-Robot Interaction (HRI) by enabling intuitive communication between humans and robots. The literature mainly addresses this problem by applying machine learning techniques to classify sequences of tactile images encoding the pressure distribution generated when executing the gestures. However, some gestures can be hard to differentiate based on the information provided by tactile images alone. In this paper, we present a simple yet effective way to improve the accuracy of a gesture recognition classifier. Our approach focuses solely on processing the tactile images used as input by the classifier. In particular, we propose to explicitly highlight the dynamics of the contact in the tactile image by computing the dense optical flow. This additional information makes it easier to distinguish between gestures that produce similar tactile images but exhibit different contact dynamics. We validate the proposed approach in a tactile gesture recognition task, showing that a classifier trained on tactile images augmented with optical flow information achieved a 9% improvement in gesture classification accuracy compared to one trained on standard tactile images.